Object Detection Loss

Fast RCNN

$L(p,u,t^u,v) L_{cls} (p,u) + \lambda[u\geq 1]L_{loc}(t^u,v),$

where $p$ is $(K+1)$-dim class probability vector with 0 being the background class, $u$ is the groundtruth class, $v$ is the ground-truth regression tuple, and $t^u$ is the predicted regression tuple for class $u$. $L_{cls}$ is a multi-class softmax loss and $L_{loc}$ is a smooth L1 loss.

Faster RCNN

$L(p_i,t_i)=L_{cls} (p_i,p_i^*) + \lambda p_i^*L_{reg}(t_i,t_i^*),$

where $L_{cls}$ is a two-class (e.g., obj or not obg) (resp., multi-class) softmax loss for RPN (resp., gen) and $L_{reg}$ is a smooth L1 loss. So the loss of faster RCNN is basically the same as fast RCNN.

fast and faster RCNN generate proposals, so they have the pos/neg labels for anchor boxes. However, the following SSD and YOLO do not generate proposals, so they need to match anchor boxes with ground-truth boxes.

SSD

By using $x_{ij}^p$ as a binary indicator for matching the i-th default box to the j-th ground-truth box of category p. Multiple detection boxes can be matched to the same ground-truth box.

$l(x,c,l,g)=L_{conf}(x,c) + \alpha L_{loc}(x,l,g),$

where $L_{conf}$ is a (K+1)-class softmax loss, and

$L_{loc} (x,l,g)=\sum_i \sum_j x_{ij}^k |l_i-g_j|.$

YOLO

$\sum_{i=0}^{S^2}\sum_{j=0}^B \mathcal{1^{obj}}[(x_i-\hat{x}_i)^2+(y_i-\hat{y}_i)^2] + \sum_{i=0}^{S^2}\sum_{j=0}^B \mathcal{1^{obj}}[(\sqrt{w_i}-\sqrt{\hat{w}_i)}^2+(\sqrt{h_i}-\sqrt{\hat{y}_i})^2] + \sum_{i=0}^{S^2}\sum_{j=0}^B \mathcal{1^{obj}} (C_i-\hat{C}_i)^2+ \sum_{i=0}^{S^2}\sum_{j=0}^B \mathcal{1^{noobj}}(C_i-\hat{C}_i)^2 + \sum_{i=0}^{S^2}\sum_{j=0}^B \mathcal{1^{obj}}\sum_{c} (p_i(c)-\hat{p}_i(c))^2$

Note that for the noobj anchorboxes, there is only one loss term involved.